Our motivation in choosing this data-set for further analysis was due to problems that we hope we can answer:
Does various predicting factors which has been chosen initially really affect the Life expectancy?
What are the predicting variables actually affecting the life expectancy?
Should a country having a lower life expectancy value(<65) increase its healthcare expenditure in order to improve its average lifespan? How does Infant and Adult mortality rates affect life expectancy?
Does Life Expectancy has positive or negative correlation with eating habits, lifestyle, exercise, smoking, drinking alcohol etc.
What is the impact of schooling on the lifespan of humans?
Does Life Expectancy have positive or negative relationship with drinking alcohol?
Do densely populated countries tend to have lower life expectancy?
What is the impact of Immunization coverage on life Expectancy?
Do the sample gives enough evidence to say that Developed countries have more average life expectancy than Developing countries?
Do the countries that spend a higher proportion of their resources on human development have a higher life expectancy?
What is the most frequent range of life expectancy?
For this project, we obtained the Life Expectancy dataset from Kaggle link. The health factors data was collected from the WHO data repository website, and the corresponding economic data was obtained from the United Nations website with the assistance of Deeksha Russell and Duan Wang. The dataset was collected for 193 countries between the years 2000-2015, and it consists of 2938 observations and 22 attributes, of which 20 are meant to be predicting variables. These predicting variables have been divided into several broad categories, including immunization-related factors, mortality factors, economical factors, and social factors.
20 real valued features:
It’s worth mentioning that the data frame contains some missing values for attributes such as Hepatitis B, Alcohol, GDP, and others. Additionally, some countries, such as Vanuatu, Tonga, Togo, Cabo Verde, etc., have been excluded from the dataset because they had too many missing values, which would negatively impact the result.
In this step, we examine the presence of missing values in our dataset and perform necessary preprocessing. Approximately 43.87% of the dataset contains missing values, which is nearly half of the data. It is important to analyze how missing values are distributed across different attributes.
Upon analyzing the data, we found that certain attributes have a significant number of missing values. The highest number of missing values is observed in the attributes of Population, GDP, Hepatitis B, followed by Total Expenditure, Alcohol, Income Composition of Resources, and Schooling.
The attribute with the most missing values is Population. Dealing with missing data presents several challenges, and one approach is to remove the missing entries. However, in our case, this would result in discarding a substantial portion of our dataset, which could adversely affect the accuracy of future predictions. Another option is to replace missing values with the mean or median of the Population variable. However, due to the wide range of values in this feature, such an approach would introduce inaccuracies.
After careful consideration, we decided to conduct further research and obtained actual values for most of the missing data from The World Bank website (link. The next step involves importing a new dataset and replacing the missing values with the data retrieved from The World Bank site.
We successfully obtained actual values for a substantial portion of the missing data, and as a result, the number of missing values in the Population attribute has been reduced to 50 entries. Consequently, we can now proceed to remove these remaining missing values from our dataset.
After successfully addressing the missing values in the Population and GDP attributes, we still have other attributes that contain missing values. For these attributes, we were unable to find missing values from external sources. Hence, we need to employ different techniques to handle them, such as imputation with mean, imputation with median, or combined imputation, depending on the nature of the data.
To handle the missing data, we will utilize mean or median imputation. Mean imputation is suitable for attributes that follow a normal or approximately symmetric distribution without significant outliers. On the other hand, median imputation is more appropriate for attributes with skewed distributions and significant outliers. To assess the distribution of each attribute, we will plot histograms for visualization.
Based on the previous histograms, we can assume that attributes such as Life Expectancy, Total Expenditure, Income Composition of Resources, and Schooling exhibit a bell-shaped normal curve. Therefore, these attributes are potential candidates for mean imputation. However, before applying mean imputation, we also need to examine the presence of outliers in each attribute to ensure that our imputation process is not influenced by extreme values.
To identify outliers in each attribute, we will utilize boxplots. Upon analyzing the previous boxplots, we observe that out of the four candidates identified earlier, only Schooling and Income Composition of Resources are more suitable for mean imputation. For the remaining 15 attributes, we have decided to proceed with median imputation.
Furthermore, we check the number of NA values in the dataset again to gain a better understanding of the extent of missing data.
In this stage, we have confirmed that our dataset does not contain any missing values (NA-values). Hence, we can proceed to the next step of processing outliers. There are several methods available for outlier detection, including visual techniques like boxplots and histograms, as well as statistical methods such as Tukey’s Method. Once outliers have been identified using these methods, it is important to preprocess them accordingly. Several techniques can be employed for outlier preprocessing, including:
Dropping outliers: One approach is to remove the outliers from the dataset entirely, excluding them from subsequent analyses. This can be appropriate when outliers are deemed as data errors or extreme values that do not align with the overall pattern of the data.
Limiting/Winsorizing outliers: Instead of eliminating outliers, this technique involves capping or replacing outlier values with predefined limits. By setting a threshold, the extreme values are brought within an acceptable range while retaining their relative position in the distribution. Winsorizing is a common variation of this approach.
Transforming the data: Another strategy is to apply mathematical transformations to the data, such as taking logarithms, inverses, square roots, or other suitable transformations. These transformations can help normalize the distribution and mitigate the impact of outliers on subsequent analyses.
Our plan is to describe and apply these three methods. For each method, we will evaluate the results of the model. The final decision on the best method will be based on the performance of the models. The first step is to plot boxplots for each attribute in our dataset, which will help us identify potential outliers.
| Metric | Number of outliers | Percent of data that is outlier |
|---|---|---|
| Life.expectancy | 19 | 0.67 |
| Adult.Mortality | 88 | 3.09 |
| infant.deaths | 329 | 11.56 |
| Alcohol | 2 | 0.07 |
| percentage.expenditure | 369 | 12.97 |
| Measles | 521 | 18.31 |
| five.deaths | 0 | NaN |
| Hepatitis.B | 313 | 11 |
| HIV.AIDS | 546 | 19.19 |
| BMI | 0 | 0 |
| Polio | 259 | 9.1 |
| Total.expenditure | 42 | 1.48 |
| Diphtheria | 282 | 9.91 |
| GDP | 491 | 17.26 |
| Population | 405 | 14.24 |
| thinness..1.19.years | 109 | 3.83 |
| thinness.5.9.years | 106 | 3.73 |
| Income.composition.of.resources | 116 | 4.08 |
| Schooling | 62 | 2.18 |
Upon examining the attributes, we observed that BMI is the only attribute that does not exhibit any outliers. However, attributes such as Alcohol, Life Expectancy, and Income Composition of Resources contain a relatively small number of outliers. Consequently, completely dropping these outliers might not be the most appropriate solution for this particular problem.
To address this, we will begin by utilizing the z-score method for identifying and removing outliers.
Implementing the z-score method will allow us to effectively handle the outliers in the dataset, ensuring that they do not unduly influence the subsequent analyses.
Tukey’s Fences Tukey’s method, also known as Tukey’s fences, defines upper and lower bounds based on the interquartile range (IQR). Data points that fall beyond these bounds are considered potential outliers. This method provides a robust approach to identifying outliers, as it is less sensitive to extreme values.
Comparing these two models, we can see that the second model, which
uses the original data dataset, has a
higher adjusted R-squared value (0.8168) compared to the first model
(0.7602). This indicates that the second model explains a greater
proportion of the variance in the dependent variable (life expectancy)
based on the independent variables.
Additionally, the second model has a slightly higher residual standard error (4.042) compared to the first model (2.704), suggesting that the second model’s predictions have slightly more error or variability around the regression line. Therefore comparing IQR method and Z-score the second one is winning.
Imputation + outliers
Winsorisation Winsorization is another outlier preprocessing technique that involves replacing extreme outlier values with less extreme values within a predefined range. This approach helps to mitigate the impact of outliers while still retaining the relative position of the data points in the distribution.
let’s examine the Adjusted R-squared values, as they account for the number of predictors in each model:
| Model | Adjusted R-squared |
|---|---|
data_winsorize |
0.8722 |
data_z |
0.8638 |
data_Tukey |
0.7602 |
data_impute_out |
0.8159 |
Based on the Adjusted R-squared values, we can observe that both the
data_winsorize and
data_z models have the highest values
among the five models, with Adjusted R-squared values of 0.8722 and
0.8638, respectively. These higher values indicate that these models
explain a larger proportion of the variance in the dependent variable
compared to the other models.
However, in addition to the Adjusted R-squared values, we also need to consider the distribution of the life expectancy variable after applying the model. In our research, the assumption of a normal distribution is crucial. It is important to note that winsorization, one of the techniques used in the data_winsorize model, may potentially disturb the normality distribution of the life expectancy variable.
Taking this into consideration, we will choose the data_z model as it also provides a high Adjusted R-squared value (0.8638) while preserving the normality assumption of the life expectancy distribution.
Before proceeding with the data exploration phase, it is important to perform necessary preprocessing steps to ensure accurate and meaningful analysis. One such step is the factorization of the Status variable.
Since our analysis focuses on predicting Life Expectancy, it is crucial to ensure that this attribute follows a normal distribution.Therefore we re interested in having Life Expectancy attribute normal distribution.
In order to improve the distribution and address any potential skewness, we applied a square root transformation to the Life Expectancy values. This transformation helps in normalizing the data and reducing the impact of extreme values.
## Skewness: 0.0665222231245973
## kurtosis : 2.941852
The skewness value of -0.048 is close to zero, indicating that the distribution is approximately symmetric. A skewness value close to zero suggests that the data is relatively normally distributed or very close to it in terms of symmetry.
The kurtosis value of 2.94 is slightly less than 3. This suggests that the distribution has slightly heavier tails and potentially a slightly sharper peak compared to a normal distribution. However, a kurtosis value of 2.94 is still relatively close to 3, indicating that the distribution is not significantly different from a normal distribution in terms of its tail behavior.
Overall, based on the skewness and kurtosis values provided, the data appears to have a reasonably symmetric distribution and is relatively close to a normal distribution in terms of both skewness and kurtosis.
Another important aspect of data preprocessing is scaling the variables. It enables us to compare and analyze variables with different scales and units without any dominance based on their magnitudes. Scaling is particularly beneficial when working with algorithms that are sensitive to variable scales, such as regression models or distance-based algorithms.
By scaling the variables, we enhance interpretability and facilitate comparison. Scaling also aids in visualizing the data and identifying patterns or relationships between variables more effectively and ensuring consistency in scale across all variables.
By performing factorization of the Status variable, transforming the Life.expactancy variable, and finally scaling the numeric variables, we have prepared the dataset for further exploration and analysis, setting the stage for uncovering meaningful insights and relationships within the data.
Here we want to use methods both Univariate and Bivariate analysis. Our goals:
Exploring the relationship between continuous variables and the target variable (life expectancy) as well as their interrelationships.
Investigating the impact of categorical variables on the target variable (life expectancy).
Examining the relationship between the variables “Country Status” and “Year” with continuous variables. Note that due to the dataset containing a large number of countries with small sample sizes, making country-to-country comparisons may not provide significant insights.
Univariate analysis is looking at the data for each variable on its own. This is generally done best by using histograms for continuous data, count/barplots for categorical data and of course by getting the descriptive stats by using .summery().
As you can see, life expectancy, total expenditure, Income.composition.of.resources and schooling looks like having normal distributions.
Let’s check normality with qq-plots
This scatter plot shows that ‘Schooling’, ‘Income composition of resources’ and ‘BMI’ have a strong positive correlation with Life Expectancy. On the other hand ‘Adult Mortality’, ‘HIV/AIDS’ have a negative correlation with Life Expectancy.
In our analysis, we used the correlation matrix to explore the relationships among the scaled variables. The matrix was visualized as a heatmap, where darker or lighter shades indicated stronger correlations. This allowed us to identify clusters or groups of variables that were highly correlated. By calculating the correlation coefficients between pairs of variables, we gain insights into the strength and direction of their linear associations.
After scaling the variables, we examined the correlation between them and identified several weak correlations:
While these variables exhibit some correlation, it does not necessarily indicate collinearity among independent variables. Collinearity refers to a high degree of correlation between independent variables, which can pose challenges in statistical analysis, particularly in linear models.
To assess the potential collinearity, we recommend calculating the Variance Inflation Factor (VIF) for the variables in a linear model. The VIF helps identify if collinearity is present by quantifying the inflation in the variances of the regression coefficients. A VIF value exceeding 5 suggests a problem with collinearity, indicating that one or more variables are highly correlated with each other.
If high VIF values are observed, it is advisable to address the collinearity issue by eliminating one of the correlated variable pairs. This step helps mitigate the impact of collinearity and improves the stability and interpretability of the regression model.
It’s important to note that weak correlations between variables do not necessarily indicate collinearity. Therefore, conducting further analysis, such as calculating the VIF, is crucial to identify and address any potential collinearity issues in the dataset.
After observing the correlation matrix, it becomes evident that three pairs of variables display high Variance Inflation Factors (VIFs). In order to address this issue of collinearity, we will proceed by omitting one variable from each correlated pair based on their VIF values.
Specifically, the variables “infant.deaths” and “under.five.deaths” exhibit VIFs that significantly exceed the threshold of 5, indicating a strong collinearity. To resolve this, we will remove the variable “under.5.deaths” since it possesses a higher VIF value.
Similarly, we will eliminate the variable “Income.composition.of.resources” due to its higher VIF value.
Furthermore, we will exclude the variable “thinness.5.9.years” as it demonstrates a higher VIF value.
The updated version of the data, after removing these three features, will be saved as “data_EDA.” This modified dataset will be utilized as a dataframe for further analysis in the “Model_EDA” section.
The correlation matrix now shows no suspicious coeffecients that might indicate collinearity between the features. Upon closer examination, it becomes evident that the VIF values fall within an acceptable range, all being below the threshold of 5. With this observation, we can confidently state that the data_EDA is now prepared and suitable for further analysis in the Model_EDA phase.
To check if the continuous variables influence Life Expectancy we apply ANOVA Test to each variable.
For each variable we will categorize countries into one of the three categories: ‘Low’, ‘Medium’ and ‘High’ depending on the country’s average for that certain feature.
First we group the data by country and find the average life expectancy over the 16 years and we compute the average for the feature we want to test.
We are going to get a new dataframe having average life and level of the tested feature (low, medium, or high) as columns and each row corresponding to one among the 193 countries in the dataset.
We then apply the ANOVA Test, where the null hypothesis is H0: mu_low = mu_medium = mu_high and the alternate hypothesis is that not all the means are equal.
## Df Sum Sq Mean Sq F value Pr(>F)
## Adult.Mortality 2 9726 4863 168 <2e-16 ***
## Residuals 184 5326 29
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## Alcohol 2 3338 1669.0 26.22 9.59e-11 ***
## Residuals 184 11713 63.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## Percentage_Expenditure 2 6155 3077.3 63.65 <2e-16 ***
## Residuals 184 8897 48.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## Hepatitis_B 2 1033 516.6 6.781 0.00144 **
## Residuals 184 14018 76.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## Measles 2 2436 1217.8 17.76 8.85e-08 ***
## Residuals 184 12615 68.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## BMI 2 7029 3514 80.61 <2e-16 ***
## Residuals 184 8022 44
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## Total_Expenditure 2 1024 511.8 6.713 0.00153 **
## Residuals 184 14028 76.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## HIV 2 9467 4733 156 <2e-16 ***
## Residuals 184 5584 30
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## GDP 2 5962 2981.2 60.36 <2e-16 ***
## Residuals 184 9089 49.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## Population 2 186 93.23 1.154 0.318
## Residuals 184 14865 80.79
## Df Sum Sq Mean Sq F value Pr(>F)
## Thinness_1.19 2 7238 3619 85.22 <2e-16 ***
## Residuals 184 7813 42
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## Thinness_5.9 2 5482 2741 52.71 <2e-16 ***
## Residuals 184 9569 52
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We will be using the ANOVA test to test the significance of Human Development Index (HDI) on life expectancy. Here we will categorize countries into one of the three categories: ‘Low’ (≤0.5), ‘Medium’(>0.5 and ≤0.7), ‘High’ (>0.7) depending upon the country’s average schooling years.
Firstly, we will group the data by country and find the average life expectancy and Income.composition.of.resources for each country over the 16 years.
## Df Sum Sq Mean Sq F value Pr(>F)
## Income_Decomposition_Resources 2 8389 4195 115.9 <2e-16 ***
## Residuals 184 6662 36
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we can see, the countries with higher income composition of resources for human development have better life expectancy hence the p-value <0.05. Thus countries should spend more on the human development to achieve higher life expectancy.
Which test to use?
We will be using the ANOVA test to test the significance of education on life expectancy. Here we will categorize countries into one of the three categories: ‘Low’ (≤8), ‘Medium’(>8 and ≤12), ‘High’ (>12) depending upon the country’s average schooling years.
## Df Sum Sq Mean Sq F value Pr(>F)
## Education 2 74.73 37.37 76.04 <2e-16 ***
## Residuals 184 90.41 0.49
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we can see, all the tests return a p-value lower than 0.05, except the one for Population. This means that all the variables, except Population, have effect of Life Expectancy.
We will use a two-way ANOVA test. Here we will divide the countries into two categories for both Polio and Diphtheria. Countries having values of % immunization coverage for one-year-old greater than the median value will get category ‘High’ else ‘Low’.
Step1: Countries with polio (mean) coverage for one-year-old ≤85 will get a label ‘Low’ else ‘High’.
Step2: Countries with Diphtheria(mean) coverage for one-year-old ≤85 will get a label ‘Low’ else ‘High’.
## Df Sum Sq Mean Sq F value Pr(>F)
## Polio 1 4346 4346 87.34 < 2e-16 ***
## Diphtheria 1 1548 1548 31.10 8.64e-08 ***
## Residuals 184 9157 50
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
P-value for both Polio and Diphtheria immunization coverage for one-year old is less than 0.05, hence we can say that immunization has a significant impact on the life expectancy.
We will conduct a two-proportions z-test to compare the two independent proportions. Firstly, we will group the data by country and then find the average life expectancy, infant deaths, and under-five deaths for each country.
##
## 2-sample test for equality of proportions without continuity correction
##
## data: arg1 out of arg2
## X-squared = 1.767, df = 1, p-value = 0.1838
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.027211995 0.005211995
## sample estimates:
## prop 1 prop 2
## 0.030 0.041
Since the p- value is greater than 0.05, we see no significant difference in the two independent proportions.
To assess the relationship between alcohol consumption and adult mortality rate, we can employ two approaches. First, we can create a scatter plot to visualize the data points and discern any potential correlation between the variables. Second, we can conduct a Pearson correlation test to quantitatively measure the correlation strength between alcohol consumption and adult mortality rate. By employing these methods, we can gain a deeper understanding of the association between these variables.
##
## Pearson's product-moment correlation
##
## data: data3$Average_Adult_Mortality and data3$Average_Alcohol
## t = -3.8229, df = 185, p-value = 0.00018
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3986006 -0.1322244
## sample estimates:
## cor
## -0.2705838
##
## Pearson's product-moment correlation
##
## data: data2$Average_Life and data3$Average_Alcohol
## t = 6.6054, df = 185, p-value = 4.086e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3129765 0.5461110
## sample estimates:
## cor
## 0.4368508
The correlations between alcohol consumption and health indicators yield mixed results. While there is a positive correlation with life expectancy, there is a negative correlation with adult mortality rate. These correlations, however, are not strong enough to draw remarkable conclusions. Further research is needed to better understand the complex relationship between alcohol consumption and these health outcomes.
To answer the first question we need to see if any of the variables have effect on the life expectancy, we want to look at the categorical features first. We’ll start with ‘Status’.
First we divide the dataset into developing countries and developed countries and for each country we compute the mean of the Life Expectancy values obtained through the years.
First we want to check if the variance of the developed countries is the same as the variance of the developing countries. For this we use a F-test.
##
## F test to compare two variances
##
## data: developed$Average_Life and developing$Average_Life
## F = 0.13941, num df = 31, denom df = 146, p-value = 2.241e-08
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.08406763 0.25570670
## sample estimates:
## ratio of variances
## 0.1394094
As we can see, the p-value is lower than 0.05, so reject the null hypothesis and accept the alternate hypothesis that the variances of two populations are different.
Now we want to see if the developed countries have a higher average life expectancy than Developing countries. For this we use a two sample t-test.
##
## Welch Two Sample t-test
##
## data: developed$Average_Life and developing$Average_Life
## t = 13.165, df = 134.02, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 10.49488 Inf
## sample estimates:
## mean of x mean of y
## 79.19785 67.19263
As we can see, the p-value is smaller than 0.05, so we reject the null hypothesis and accept the alternate hypothesis that the developed countries have a higher average life expectancy than the developing countries.
Based on the result of the above t-test, there appears to be a very significant difference between ‘Developing’ and ‘Developed’ countries with respect to their Life Expectancy. Since this is the case, a comparison between the status variable and all other continuous variables should be made before moving to the feature engineering phase.
–Status Variable Compared to other Continuous Variables–
Since the status variable only contains two different values, it is likely best to compare a number of descriptive statistics for those two values with respect to all the other continuous variables.
| Variable | p-value |
|---|---|
| Life.expectancy | 9.013849e-314 |
| Adult.Mortality | 1.249538e-164 |
| infant.deaths | 7.347845e-37 |
| Alcohol | 9.458846e-192 |
| percentage.expenditure | 9.887803e-38 |
| Measles | 1.717022e-16 |
| under.five.deaths | 1.596958e-38 |
| Hepatitis.B | 9.284856e-18 |
| HIV.AIDS | 3.950474e-61 |
| BMI | 6.883171e-65 |
| Polio | 8.299911e-75 |
| Total.expenditure | 3.593352e-36 |
| Diphtheria | 1.851033e-62 |
| GDP | 5.671048e-06 |
| Population | 0.2783308 |
| thinness..1.19.years | 2.116436e-301 |
| thinness.5.9.years | 2.732105e-296 |
| Income.composition.of.resources | 1.006483e-303 |
| Schooling | 7.433698e-206 |
Based on the results, it is evident that there are significant differences between the following variables concerning a country’s status. This conclusion is drawn as none of the calculated p-values exceed 0.5.
—-Life expectancy over the years—-
We aim to investigate the trend of life expectancy over the years.
While there appears to be a positive correlation between life expectancy and the passage of years, it is essential to determine whether the differences observed between each year are statistically significant. Are these differences substantial enough to consider them meaningful variations in life expectancy?
| Time Period | P-value |
|---|---|
| 2000 to 2001 | 0.7413284 |
| 2001 to 2002 | 0.8458235 |
| 2002 to 2003 | 0.949058 |
| 2003 to 2004 | 0.7579069 |
| 2004 to 2005 | 0.6189956 |
| 2005 to 2006 | 0.6110444 |
| 2006 to 2007 | 0.7044814 |
| 2007 to 2008 | 0.813566 |
| 2008 to 2009 | 0.6288716 |
| 2009 to 2010 | 0.8541843 |
| 2010 to 2011 | 0.5278882 |
| 2011 to 2012 | 0.7830066 |
| 2012 to 2013 | 0.7871843 |
| 2013 to 2014 | 0.7159781 |
| 2014 to 2015 | 0.9554911 |
Based on the results of the conducted t-tests, the p-values obtained for all comparisons between consecutive years are greater than 0.05, indicating that there is no significant evidence to support the presence of substantial differences in Life Expectancy between these years.
To apply linear regression we need to make sure that four conditions are satisfied:
The first condition is already satisfied as we already removed the variables ‘infant.deaths’, ‘under.five.deaths’, ‘GDP’ and ‘thinness..1.19.years’, which are the variables that have a higher VIF value, so the ones with strong collinearity.
We start by building the linear model
##
## Call:
## lm(formula = Life.expectancy ~ Adult.Mortality + Status + Alcohol +
## percentage.expenditure + Hepatitis.B + Measles + BMI + Polio +
## Total.expenditure + Diphtheria + HIV.AIDS + Population +
## thinness..1.19.years + Income.composition.of.resources +
## Schooling, data = data_EDA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4078 -0.2363 0.0225 0.2719 1.8165
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.234292 0.026605 -8.806 < 2e-16 ***
## Adult.Mortality 0.252827 0.011330 22.314 < 2e-16 ***
## StatusDeveloped 0.285710 0.030692 9.309 < 2e-16 ***
## Alcohol 0.003227 0.011558 0.279 0.780106
## percentage.expenditure -0.093635 0.010004 -9.359 < 2e-16 ***
## Hepatitis.B 0.028586 0.009885 2.892 0.003859 **
## Measles 0.032791 0.009162 3.579 0.000351 ***
## BMI -0.067149 0.011234 -5.977 2.55e-09 ***
## Polio -0.065743 0.011906 -5.522 3.66e-08 ***
## Total.expenditure -0.023772 0.009391 -2.531 0.011413 *
## Diphtheria -0.091262 0.012498 -7.302 3.67e-13 ***
## HIV.AIDS 0.200055 0.010270 19.480 < 2e-16 ***
## Population -0.020474 0.009241 -2.215 0.026807 *
## thinness..1.19.years 0.056147 0.011323 4.959 7.52e-07 ***
## Income.composition.of.resources -0.150204 0.014989 -10.021 < 2e-16 ***
## Schooling -0.242062 0.015793 -15.327 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4599 on 2829 degrees of freedom
## Multiple R-squared: 0.7896, Adjusted R-squared: 0.7885
## F-statistic: 707.8 on 15 and 2829 DF, p-value: < 2.2e-16
We should preface this by saying that we don’t have to prove the four assumptions are “perfectly met”, but we need to see to which extent they are violated and see if we can get results that can be considered satisfying.
We start by checking the second condition and we do it by producing the Residuals vs Fitted plot.
To say that we have linearity the points should be evenly distributed between the two sides of the line and the red line should be approximately horizontal at zero and. The presence of a pattern may indicate a problem with some aspect of the linear model.
In our case, there is no pattern in the residual plot. This means that we can assume linear relationship between the predictors and the outcome variables.
Let’s now check the third condition.
Looking at the Q-Q plot above, we see that most of the points are located on the diagonal line, except the extremes, which deviate from the line, therefore, this Q-Q plot is inconclusive regarding the normality of the residuals. So we need to find another way to check if the normality condition is met, let’s try by plotting the histogram of the residuals.
The histogram shows that most of the residuals fall around zero and the number of observations in the tails (so the extremes) of the histogram is low. We can conclude that residuals of our regression model follow a normal distribution.
Let’s now check the last condition. To check this homoscedasticity assumption we can use the Breusch-Pagan test.
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 188.95, df = 15, p-value < 2.2e-16
As we can see, p <0.05 so there is evidence that the principle of Homoscedasticity is not fulfilled. When the principle of Homoscedasticity is not fulfilled, the estimate of the mean made by the model will continue to be good, but its confidence intervals will not.
We can also see to which extent the Homoscedasticity is violated by looking at the Scale-Location plot.
This plot shows if residuals are spread equally along the ranges of predictors, so we need to get a line which is close to being horizontal with equally spread points, which is our case. This means that the Homoscedasticity is violated but not to a large extent.
## lag Autocorrelation D-W Statistic p-value
## 1 0.6455986 0.7066183 0
## Alternative hypothesis: rho != 0
From the output obtained above, we can see that the test statistic is 0.7066183 and the corresponding p value is 0. Since, the p-value is less than 0.05, we reject the null hypothesis and conclude that the residuals are auto correlated. Also, the value of D-W statistic is approx. 0.7 which is close to 0 showing high chances of high positive autocorrelation.
In the next section, we will focus on feature selection, where we aim to identify the optimal set of variables for our analysis.
Feature selection is a crucial step in constructing models as it involves identifying a subset of relevant features from a larger dataset. In this report, we explore the process of variable selection on a life expectancy dataset using various methods such as best subset, forward inclusion, and backward elimination. Additionally, we utilize four metrics, including residual sum of squares (rss), adjusted R^2 (adjr2), Mallow’s Cp (cp), and Bayesian information criterion (bic), to determine the most relevant subset of variables. By examining the differences in the number of selected features across these criteria, we can determine which criterion yields a more parsimonious model. we aim to identify the criteria that best align with our goal of selecting a subset of features that captures the essential information while minimizing redundancy. Ultimately, this enables us to construct a robust and interpretable model for predicting life expectancy.
We will use 3 regression subset methods to come up with the most relevant subset of features namely:
Forward Selection: We begin with an empty model and iteratively add the most significant feature based on the chosen criterion (BIC, Cp, or adjusted R^2) until a stopping condition is met.
Backward Selection: We start with all features and iteratively remove the least significant feature based on the selected criterion until a stopping condition is met.
Mixed Selection: This method combines forward and backward selection, iteratively adding and removing features based on the chosen criterion until a stopping condition is satisfied.
The different metrics used to evaluate the best subsets of variables provided varying recommendations:
RSS: 16 variables
adjr2: 16 variables
Cp: 15 variables
BIC: 13 variables.
Based on these metrics, we have decided to prioritize the BIC as the criterion for selecting the best subset. This is because the BIC tends to penalize models with a larger number of variables more heavily. As a result, the BIC generally favors smaller models compared to Cp and AIC. In this analysis, we specifically exclude the AIC criterion since Cp and AIC yield equivalent results in terms of selecting the same model. Therefore, by considering the BIC metric, we can achieve a balance between model complexity and goodness of fit, resulting in the selection of a subset with a minimum number of variables, as anticipated.
We generate visualizations to compare the performance of different subsets based on the R-squared, adjusted R-squared, Cp, and BIC metrics. Each plot represents the number of predictors on the x-axis and the respective metric on the y-axis. Metrices values for different combinations of predictor variables in a multivariate regression model. Each row of the figure has black boxes (the variable is used in the model) and white boxes (the variable is not used) and it represents a separate multivariate model with its own metrice value.
Conclusion: Through the application of regression subset methods, we successfully perform variable selection and evaluate the model performance using multiple metrics. The plots assist in understanding the impact of different subsets on model fit, while the coefficient analysis sheds light on the significance of predictors in explaining life expectancy. These findings contribute to the development of a robust and interpretable model for predicting life expectancy based on the available dataset.
## Number of Optimal Features by forward selection: 13
## Number of Optimal Features by backward selection: 13
## Number of Optimal Features by mixed techniques: 13
The feature selection analysis, incorporating forward, backward, and mixed techniques, consistently identifies 13 features as the optimal subset for building a predictive model. These selected features showcase their importance in accurately predicting the target variable, enhancing the model’s performance and interpretability. By focusing on these 12 features, we can create a streamlined and efficient model that avoids unnecessary complexity and reduces the risk of overfitting. Additionally, we examine the selected variables and their coefficients for each method to gain further insights into the model’s behavior.
Overall, when the forward, backward, and mixed methods yield the same coefficients for specific variables, it strengthens the evidence for the importance and reliability of those variables in predicting the target variable. It provides consistency, stability, and confidence in the selected features, which are crucial for building effective and interpretable regression models.
Initially, we apply a simple linear model to the dataset using selected features. We then compare the results obtained from this model with those of a linear model trained on the entire set of features. Subsequently, we explore the use of ridge regression and lasso regression models, incorporating shrinkage parameters, to observe any differences in the outcomes.
## RMSE: 0.4600969
##
## Call:
## lm(formula = Life.expectancy ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3795 -0.2376 0.0182 0.2688 1.7982
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.266538 0.029828 -8.936 < 2e-16 ***
## Adult.Mortality 0.256353 0.013825 18.542 < 2e-16 ***
## Hepatitis.B 0.032883 0.011899 2.764 0.005771 **
## BMI -0.070112 0.013442 -5.216 2.02e-07 ***
## Diphtheria -0.085438 0.014970 -5.707 1.32e-08 ***
## HIV.AIDS 0.195127 0.012229 15.957 < 2e-16 ***
## Income.composition.of.resources -0.137560 0.017278 -7.961 2.84e-15 ***
## Measles 0.040019 0.011324 3.534 0.000419 ***
## percentage.expenditure -0.081826 0.011594 -7.058 2.33e-12 ***
## Polio -0.065741 0.013873 -4.739 2.30e-06 ***
## Schooling -0.248588 0.018170 -13.681 < 2e-16 ***
## StatusDeveloped 0.326859 0.034164 9.567 < 2e-16 ***
## thinness..1.19.years 0.049813 0.013004 3.831 0.000132 ***
## Total.expenditure -0.008764 0.011094 -0.790 0.429643
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4612 on 1981 degrees of freedom
## Multiple R-squared: 0.7877, Adjusted R-squared: 0.7864
## F-statistic: 565.5 on 13 and 1981 DF, p-value: < 2.2e-16
## RMSE 0.4601041
##
## Call:
## lm(formula = Life.expectancy ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.33948 -0.23328 0.02395 0.27052 1.77515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2404638 0.0320903 -7.493 1.01e-13 ***
## Adult.Mortality 0.2554388 0.0138661 18.422 < 2e-16 ***
## infant.deaths 0.0022790 0.0140311 0.162 0.870988
## Alcohol -0.0139939 0.0138083 -1.013 0.310974
## percentage.expenditure -0.0859182 0.0116764 -7.358 2.72e-13 ***
## Hepatitis.B 0.0312245 0.0120290 2.596 0.009508 **
## Measles 0.0420960 0.0126921 3.317 0.000927 ***
## BMI -0.0677261 0.0134526 -5.034 5.23e-07 ***
## Polio -0.0663533 0.0138561 -4.789 1.80e-06 ***
## Total.expenditure 0.0006514 0.0114387 0.057 0.954590
## Diphtheria -0.0831358 0.0149885 -5.547 3.30e-08 ***
## HIV.AIDS 0.1954656 0.0122778 15.920 < 2e-16 ***
## GDP -0.0338749 0.0128747 -2.631 0.008576 **
## Population -0.0095069 0.0143559 -0.662 0.507901
## thinness..1.19.years 0.0486770 0.0144623 3.366 0.000778 ***
## Income.composition.of.resources -0.1385853 0.0172701 -8.025 1.73e-15 ***
## Schooling -0.2496408 0.0185953 -13.425 < 2e-16 ***
## StatusDeveloped 0.2944134 0.0371639 7.922 3.87e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4603 on 1977 degrees of freedom
## Multiple R-squared: 0.789, Adjusted R-squared: 0.7872
## F-statistic: 434.9 on 17 and 1977 DF, p-value: < 2.2e-16
| Model | R-squared | Adjusted R-squared | Mean Squared Error |
|---|---|---|---|
| Reduced Model | 0.7877 | 0.7864 | 0.4600969 |
| Full Model | 0.789 | 0.7872 | 0.4601041 |
##
## F test to compare two variances
##
## data: (y_test - predictions) and (predictionsEDA - y_testEDA)
## F = 1.0265, num df = 849, denom df = 838, p-value = 0.7041
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.8968019 1.1749280
## sample estimates:
## ratio of variances
## 1.026519
The F-test was conducted to compare the variances of the reduced model and full model. The test yielded an F-statistic of 1.026519., with a p-value of 0.7041. The results indicate that there are no significant differences in variances between the two models. Therefore, we conclude that the variances observed are likely due to random variation.
| Model | Mean Squared Error | R-squared | Adjusted R-squared |
|---|---|---|---|
| Reduced Model | 0.214421 | 0.7877608 | 0.7844604 |
| Full Model | 0.2100862 | 0.7958114 | 0.7915834 |
##
## F test to compare two variances
##
## data: (y_test - predictions) and (predictionsEDA - y_testEDA)
## F = 1.0331, num df = 849, denom df = 838, p-value = 0.6365
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.9025581 1.1824694
## sample estimates:
## ratio of variances
## 1.033108
The F-test was conducted to compare the variances of the reduced model and full model. The test yielded an F-statistic of 1.0331, with p-value of 0.6365. The results indicate that there are no significant differences in variances between the two models. Therefore, we conclude that the variances observed are likely due to random variation.
| Model | Mean Squared Error | R-squared | Adjusted R-squared |
|---|---|---|---|
| Reduced Model | 0.2139883 | 0.7881891 | 0.7848954 |
| Full Model | 0.2082843 | 0.7975627 | 0.793371 |
| Model | Type | Degree | Mean Squared Error | R-squared | Adjusted R-squared | p-value |
|---|---|---|---|---|---|---|
| Polynomial | Reduced Model | 3 | 0.4036827 | 0.8529 | 0.8501 | 2.2e-16 |
| Polynomial | Full Model | 3 | 0.4036827 | 0.8529 | 0.8501 | 2.2e-16 |
| Polynomial | Reduced Model | mixed | 0.4112296 | 0.8471 | 0.8454 | 2.2e-16 |
| Polynomial | Full Model | mixed | 0.4112296 | 0.8471 | 0.8454 | 2.2e-16 |
| Model | Type | Mean Squared Error | R-squared | Adjusted R-squared |
|---|---|---|---|---|
| Lasso Regression | Reduced Model | 0.2139883 | 0.7881891 | 0.7848954 |
| Lasso Regression | Full Model | 0.2082843 | 0.7975627 | 0.793371 |
| Ridge Regression | Reduced Model | 0.214421 | 0.7877608 | 0.7844604 |
| Ridge Regression | Full Model | 0.2100862 | 0.7958114 | 0.7915834 |
| Simple Linear Model | Reduced Model | 0.4600969 | 0.7877 | 0.7864 |
| Simple Linear Model | Full Model | 0.4601041 | 0.789 | 0.7872 |
| Polynomial D=3. | Reduced Model | 0.4036827 | 0.8529 | 0.8501 |
| Polynomial D=3. | Full Model | 0.4036827 | 0.8529 | 0.8501 |
| Polynomial MIXED | Reduced Model | 0.4112296 | 0.8471 | 0.8454 |
| Polynomial MIXED | Full Model | 0.4112296 | 0.8471 | 0.8454 |
Based on these metrics, we can see that the Lasso Regression model has a slightly lower MSE and a slightly higher R-squared value compared to the Ridge Regression model. This suggests that the Lasso Regression model performs slightly better in terms of prediction accuracy and explaining the variance in the target variable.
However, it’s important to note that the difference between the models is relatively small. Further analysis, such as cross-validation or hypothesis testing, could provide additional insights into the statistical significance and stability of the model performance.
In addition, the full models (both Lasso and Ridge Regression) tend to perform slightly better than the reduced models. They have lower MSE values and higher R-squared and adjusted R-squared values, indicating better overall performance in terms of prediction accuracy and capturing the variance in the target variable. We can conclude that the Feature selection Model provides good results given that it decreases the number of features. it’s important to consider other factors such as model complexity and interpretability when selecting the best model for a particular application.
The polynomial models with degree 3 (both reduced and full) have similar MSE values, indicating a good fit to the data. The R-squared and adjusted R-squared values for these models are relatively high, suggesting a good explanation of the variance in the data. The polynomial mixed degree models also perform well, although slightly worse in terms of MSE compared to the degree 3 models. The R-squared and adjusted R-squared values for the mixed degree models are slightly lower than those of the degree 3 models but still indicate a reasonable fit. In conclusion, based on the given results, the Lasso Regression model on the whole dataset (Full Model) appears to be the better choice among the three models for predicting the target variable.
Then we can say that the main factors that affect life expectancy are:
We also came to the following conclusions:
Education has a significant impact on life expectancy
Life Expectancy have negative relationship with drinking alcohol
Immunization against Hepatitis B and Diphtheria positively impact on life Expectancy
Countries with higher income composition of resources for human development have a better life expectancy.
There is no significant difference in proportions of the number of infant deaths and the number of under-five deaths.
There is no strong correlation between alcohol consumption and life expectancy
Most frequent range for life expectancy is 65–82 Years and the least frequent range is less than 45 years and more than 85 years.
Immunization coverage has a significant impact on life expectancy
Population doesn't have big impact on life expectancy